This is a course on Open Data Science.
Github
My Github repository can be found from here
Load GGally (and ggplot2), set the working directory to the IODS folder and read in the local tab separated copy of the file
library(GGally)
## Loading required package: ggplot2
setwd("~/Work/IODS-project")
analysis2014 <- read.table("data/analysis2014.txt", sep="\t", header=TRUE)
str(analysis2014)
## 'data.frame': 166 obs. of 7 variables:
## $ Age : int 53 55 49 53 49 38 50 37 37 42 ...
## $ Attitude: int 37 31 25 35 37 38 35 29 38 21 ...
## $ Points : int 25 12 24 10 22 21 21 31 24 26 ...
## $ gender : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
## $ deep : num 3.58 2.92 3.5 3.5 3.67 ...
## $ surf : num 2.58 3.17 2.25 2.25 2.83 ...
## $ stra : num 3.38 2.75 3.62 3.12 3.62 ...
dim(analysis2014)
## [1] 166 7
The data has 166 rows and 7 columns. The columns show the age, gender, attitude and points of the 166 students.
In addition the data frame has three combined scores for deep learning, structured learning and strategic learning.
ggpairs(analysis2014, mapping = aes(col=gender, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
summary(analysis2014)
## Age Attitude Points gender deep
## Min. :17.00 Min. :14.00 Min. : 7.00 F:110 Min. :1.583
## 1st Qu.:21.00 1st Qu.:26.00 1st Qu.:19.00 M: 56 1st Qu.:3.333
## Median :22.00 Median :32.00 Median :23.00 Median :3.667
## Mean :25.51 Mean :31.43 Mean :22.72 Mean :3.680
## 3rd Qu.:27.00 3rd Qu.:37.00 3rd Qu.:27.75 3rd Qu.:4.083
## Max. :55.00 Max. :50.00 Max. :33.00 Max. :4.917
## surf stra
## Min. :1.583 Min. :1.250
## 1st Qu.:2.417 1st Qu.:2.625
## Median :2.833 Median :3.188
## Mean :2.787 Mean :3.121
## 3rd Qu.:3.167 3rd Qu.:3.625
## Max. :4.333 Max. :5.000
Most of the variables have a close to normal distribution, except age that is closer to poisson distribution. There’s twice as much women in the data set. The strongest correlation can be seen between Attitude and Points.
lm1 <- lm(Points~Age+gender+Attitude, data=analysis2014)
summary(lm1)
##
## Call:
## lm(formula = Points ~ Age + gender + Attitude, data = analysis2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.4590 -3.3221 0.2186 4.0247 10.4632
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.42910 2.29043 5.863 2.48e-08 ***
## Age -0.07586 0.05367 -1.414 0.159
## genderM -0.33054 0.91934 -0.360 0.720
## Attitude 0.36066 0.05932 6.080 8.34e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.315 on 162 degrees of freedom
## Multiple R-squared: 0.2018, Adjusted R-squared: 0.187
## F-statistic: 13.65 on 3 and 162 DF, p-value: 5.536e-08
lm2 <- lm(Points~Attitude, data=analysis2014)
summary(lm2)
##
## Call:
## lm(formula = Points ~ Attitude, data = analysis2014)
##
## Residuals:
## Min 1Q Median 3Q Max
## -16.9763 -3.2119 0.4339 4.1534 10.6645
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.63715 1.83035 6.358 1.95e-09 ***
## Attitude 0.35255 0.05674 6.214 4.12e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared: 0.1906, Adjusted R-squared: 0.1856
## F-statistic: 38.61 on 1 and 164 DF, p-value: 4.119e-09
In the model with age, gender and attitude as explanatory variables, only attitude was significantly affecting the points. Age or gender do not affect the points. In the final model the attitude has significant positive correlation with points
par(mfrow=c(2,2))
plot(lm2, which = c(1,2,5))
It can be seen from the diagnostic plots for the second model that there’s no pattern in the residuals, so we have constant variance. There are some possible outliers as shown in the plot. The residuals have a fairly normal distribution, again excluding the some possible outliers. Also the leverage plot shows that few points could be outliers in this data set.
The libraries needed in the analyses
library(GGally)
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(dplyr)
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:randomForest':
##
## combine
## The following object is masked from 'package:GGally':
##
## nasa
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
We start by reading in the data that was saved after the data wrangling part.
alc <- read.table("data/alc_data.txt", sep="\t", header=TRUE)
colnames(alc)
## [1] "school" "sex" "age" "address" "famsize"
## [6] "Pstatus" "Medu" "Fedu" "Mjob" "Fjob"
## [11] "reason" "nursery" "internet" "guardian" "traveltime"
## [16] "studytime" "failures" "schoolsup" "famsup" "paid"
## [21] "activities" "higher" "romantic" "famrel" "freetime"
## [26] "goout" "Dalc" "Walc" "health" "absences"
## [31] "G1" "G2" "G3" "alc_use" "high_use"
The data consists of different attributes of students from two Portugese schools. Some of the data points in the two sets are by the same students and the data sets have been joined based on their answers. Only the students included in both questionnaires are included.
The four variables I think could be affected or linked with high or low alcohol consumption are:
alc_rf <- randomForest(high_use ~ ., data=alc)
## Warning in randomForest.default(m, y, ...): The response has five or fewer
## unique values. Are you sure you want to do regression?
varImpPlot(alc_rf)
ggpairs(alc[,c("high_use", "sex", "absences", "G3", "goout")], aes(col=high_use, alpha=0.3), lower = list(combo = wrap("facethist", bins = 20)))
table(high_use = alc$high_use, sex = alc$sex)
## sex
## high_use F M
## FALSE 156 112
## TRUE 42 72
boxplot(absences~high_use, data=alc, ylab="Number of absences", xlab="High use")
boxplot(G3~high_use, data=alc, ylab="Final grade", xlab="High use")
boxplot(goout~high_use, data=alc, ylab="Number of times going out w/ friends", xlab="High use")
There seems to be slightly more heavy users in male. Also the heavy users seem to have more absences, but their grades do not differ. Probably the strongest association can be seen in the times going out with friends. The more you go out with friends, the more likely you are a heavy user.
m1 <- glm(high_use ~ sex + absences + G3 + goout, data = alc, family = "binomial")
summary(m1)
##
## Call:
## glm(formula = high_use ~ sex + absences + G3 + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.9286 -0.8055 -0.5308 0.8264 2.4649
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.57370 0.68720 -5.200 1.99e-07 ***
## sexM 0.95202 0.25496 3.734 0.000188 ***
## absences 0.08215 0.02237 3.672 0.000241 ***
## G3 -0.04453 0.03883 -1.147 0.251571
## goout 0.70844 0.12065 5.872 4.31e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 386.44 on 377 degrees of freedom
## AIC: 396.44
##
## Number of Fisher Scoring iterations: 4
coef(m1)
## (Intercept) sexM absences G3 goout
## -3.57369722 0.95202361 0.08214567 -0.04452530 0.70844342
OR <- coef(m1) %>% exp()
CI <- confint(m1) %>% exp()
## Waiting for profiling to be done...
cbind(OR, CI)
## OR 2.5 % 97.5 %
## (Intercept) 0.02805195 0.006927766 0.1031581
## sexM 2.59094742 1.581095123 4.3051892
## absences 1.08561394 1.040315244 1.1370409
## G3 0.95645140 0.886134749 1.0323272
## goout 2.03082765 1.612303135 2.5901641
From the model summary and the odds ratios and their coefficients we can see that the sex, number of absences and the number of times going out with friends are all associated with the hugh use of alcohol.
Males use more alcohol, the more absences the more likely to use more alcohol, and the more times going out with friends, the more likely to use more alcohol.
m2 <- glm(high_use ~ sex + absences + goout, data = alc, family = "binomial")
summary(m2)
##
## Call:
## glm(formula = high_use ~ sex + absences + goout, family = "binomial",
## data = alc)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.7871 -0.8153 -0.5446 0.8357 2.4740
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.16317 0.47506 -8.764 < 2e-16 ***
## sexM 0.95872 0.25459 3.766 0.000166 ***
## absences 0.08418 0.02237 3.764 0.000167 ***
## goout 0.72981 0.11970 6.097 1.08e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 465.68 on 381 degrees of freedom
## Residual deviance: 387.75 on 378 degrees of freedom
## AIC: 395.75
##
## Number of Fisher Scoring iterations: 4
probabilities <- predict(m2, type = "response")
alc <- mutate(alc, probability = probabilities)
alc <- mutate(alc, prediction = probability > 0.5)
The cross tabulation from the predictive power of our model and a graphical vialization of it.
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table() %>% addmargins()
## prediction
## high_use FALSE TRUE Sum
## FALSE 0.66230366 0.03926702 0.70157068
## TRUE 0.17015707 0.12827225 0.29842932
## Sum 0.83246073 0.16753927 1.00000000
ggplot(alc, aes(y = high_use, x = probability, col=prediction)) + geom_point()
From the cross tabulation we can see that the predictive power is close to 0.80, meaning ~20 % are wrongly predicted. Next we can compare the predictive power to a simple guessing strategy.
loss_func <- function(class, prob) {
n_wrong <- abs(class - prob) > 0.5
mean(n_wrong)
}
loss_func(alc$high_use, prob=0)
## [1] 0.2984293
loss_func(alc$high_use, alc$probability)
## [1] 0.2094241
We can see tht the model was better than simple guessing (~0.21 error vs. ~0.30 error).
We can refine our cross validation by performing a 10-fold cross-validaton.
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m2, K = 10)
cv$delta[1]
## [1] 0.2146597
We can see that this model was better than the one introduced in the DataCamp (0.26 error).
Hooray for that!
Load the data from the MASS package and explore the structure and dimensions
library(dplyr)
library(ggplot2)
library(MASS)
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
data("Boston")
str(Boston)
## 'data.frame': 506 obs. of 14 variables:
## $ crim : num 0.00632 0.02731 0.02729 0.03237 0.06905 ...
## $ zn : num 18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
## $ indus : num 2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
## $ chas : int 0 0 0 0 0 0 0 0 0 0 ...
## $ nox : num 0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
## $ rm : num 6.58 6.42 7.18 7 7.15 ...
## $ age : num 65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
## $ dis : num 4.09 4.97 4.97 6.06 6.06 ...
## $ rad : int 1 2 2 3 3 3 5 5 5 5 ...
## $ tax : num 296 242 242 222 222 222 311 311 311 311 ...
## $ ptratio: num 15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
## $ black : num 397 397 393 395 397 ...
## $ lstat : num 4.98 9.14 4.03 2.94 5.33 ...
## $ medv : num 24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
dim(Boston)
## [1] 506 14
The Boston dataset is a data frame with 506 rows and 14 columns. The columns present different housing values in the suburbs of Boston.
pairs(Boston)
summary(Boston)
## crim zn indus chas
## Min. : 0.00632 Min. : 0.00 Min. : 0.46 Min. :0.00000
## 1st Qu.: 0.08204 1st Qu.: 0.00 1st Qu.: 5.19 1st Qu.:0.00000
## Median : 0.25651 Median : 0.00 Median : 9.69 Median :0.00000
## Mean : 3.61352 Mean : 11.36 Mean :11.14 Mean :0.06917
## 3rd Qu.: 3.67708 3rd Qu.: 12.50 3rd Qu.:18.10 3rd Qu.:0.00000
## Max. :88.97620 Max. :100.00 Max. :27.74 Max. :1.00000
## nox rm age dis
## Min. :0.3850 Min. :3.561 Min. : 2.90 Min. : 1.130
## 1st Qu.:0.4490 1st Qu.:5.886 1st Qu.: 45.02 1st Qu.: 2.100
## Median :0.5380 Median :6.208 Median : 77.50 Median : 3.207
## Mean :0.5547 Mean :6.285 Mean : 68.57 Mean : 3.795
## 3rd Qu.:0.6240 3rd Qu.:6.623 3rd Qu.: 94.08 3rd Qu.: 5.188
## Max. :0.8710 Max. :8.780 Max. :100.00 Max. :12.127
## rad tax ptratio black
## Min. : 1.000 Min. :187.0 Min. :12.60 Min. : 0.32
## 1st Qu.: 4.000 1st Qu.:279.0 1st Qu.:17.40 1st Qu.:375.38
## Median : 5.000 Median :330.0 Median :19.05 Median :391.44
## Mean : 9.549 Mean :408.2 Mean :18.46 Mean :356.67
## 3rd Qu.:24.000 3rd Qu.:666.0 3rd Qu.:20.20 3rd Qu.:396.23
## Max. :24.000 Max. :711.0 Max. :22.00 Max. :396.90
## lstat medv
## Min. : 1.73 Min. : 5.00
## 1st Qu.: 6.95 1st Qu.:17.02
## Median :11.36 Median :21.20
## Mean :12.65 Mean :22.53
## 3rd Qu.:16.95 3rd Qu.:25.00
## Max. :37.97 Max. :50.00
There’s both continuous and binary variables in the data. Some of the varibles are highly correlated, either negatively, such as lower status of the population (lstat) and median value of owner-occupied homes in $1000 (medv), or positively, such as full-value property-tax rate per $10,000 (tax) and proportion of non-retail business acres per town (indus).
However, they are not all normally distributed or have the same variance which are the assumptions for LDA. So we need to scale the variables.
We scale the values to have a mean of 0 and standard deviation of 1.
boston_scaled <- scale(Boston)
boston_scaled <- as.data.frame(boston_scaled)
apply(boston_scaled,2, sd, na.rm=TRUE)
## crim zn indus chas nox rm age dis rad
## 1 1 1 1 1 1 1 1 1
## tax ptratio black lstat medv
## 1 1 1 1 1
summary(boston_scaled)
## crim zn indus
## Min. :-0.419367 Min. :-0.48724 Min. :-1.5563
## 1st Qu.:-0.410563 1st Qu.:-0.48724 1st Qu.:-0.8668
## Median :-0.390280 Median :-0.48724 Median :-0.2109
## Mean : 0.000000 Mean : 0.00000 Mean : 0.0000
## 3rd Qu.: 0.007389 3rd Qu.: 0.04872 3rd Qu.: 1.0150
## Max. : 9.924110 Max. : 3.80047 Max. : 2.4202
## chas nox rm age
## Min. :-0.2723 Min. :-1.4644 Min. :-3.8764 Min. :-2.3331
## 1st Qu.:-0.2723 1st Qu.:-0.9121 1st Qu.:-0.5681 1st Qu.:-0.8366
## Median :-0.2723 Median :-0.1441 Median :-0.1084 Median : 0.3171
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.:-0.2723 3rd Qu.: 0.5981 3rd Qu.: 0.4823 3rd Qu.: 0.9059
## Max. : 3.6648 Max. : 2.7296 Max. : 3.5515 Max. : 1.1164
## dis rad tax ptratio
## Min. :-1.2658 Min. :-0.9819 Min. :-1.3127 Min. :-2.7047
## 1st Qu.:-0.8049 1st Qu.:-0.6373 1st Qu.:-0.7668 1st Qu.:-0.4876
## Median :-0.2790 Median :-0.5225 Median :-0.4642 Median : 0.2746
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.6617 3rd Qu.: 1.6596 3rd Qu.: 1.5294 3rd Qu.: 0.8058
## Max. : 3.9566 Max. : 1.6596 Max. : 1.7964 Max. : 1.6372
## black lstat medv
## Min. :-3.9033 Min. :-1.5296 Min. :-1.9063
## 1st Qu.: 0.2049 1st Qu.:-0.7986 1st Qu.:-0.5989
## Median : 0.3808 Median :-0.1811 Median :-0.1449
## Mean : 0.0000 Mean : 0.0000 Mean : 0.0000
## 3rd Qu.: 0.4332 3rd Qu.: 0.6024 3rd Qu.: 0.2683
## Max. : 0.4406 Max. : 3.5453 Max. : 2.9865
Then we use the quantile function to break the crime rates to 4 bins and categorize the crime rates to 4 different categories. Then we remove the original crime rate variuable and substitute it with the new categorial variables.
bins <- quantile(boston_scaled$crim)
crime <- cut(boston_scaled$crim, breaks = bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)
After that we divide the data to test and train sets, with 80 % of the data going to train set. We also save the correct classes from the test set and remove the crime variable from it.
n <- nrow(boston_scaled)
ind <- sample(n, size = n * 0.8)
train <- boston_scaled[ind,]
test <- boston_scaled[-ind,]
lda.fit <- lda(crime~., data=train)
classes <- as.numeric(train$crime)
plot(lda.fit, dimen=2, col=classes)
First we save the correct classess in the test set and remove it from the data. Then we predict the crime rate in the test set using the model from ther train set and cross-tabulate it with the correct classes.
correct_classes <- test$crime
test <- dplyr::select(test, -crime)
lda.pred <- predict(lda.fit, newdata=test)
table(correct = correct_classes, predicted = lda.pred$class)
## predicted
## correct low med_low med_high high
## low 18 8 3 0
## med_low 3 18 4 0
## med_high 0 7 10 2
## high 0 0 0 29
The model predicted quite well the crime categories. Probably with fewer classes, e.g. 3, the result would have been even better.
data("Boston")
boston_scaled <- scale(Boston)
boston_euk_dist <- dist(boston_scaled)
Then we can run the k-means clustering with 3 clusters and visualise the result with few relevant variables.
km <-kmeans(boston_scaled, centers = 3)
pairs(boston_scaled[,6:10], col = km$cluster)
It seems that 3 clusters might not be the best, so we can investigate the optimal number of clusters.
set.seed(123)
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(boston_scaled, k)$tot.withinss})
qplot(x = 1:k_max, y = twcss, geom = 'line')
It seems that 2 is the most optimal number of clusters, so let’s visualise that on the same variables.
km <-kmeans(boston_scaled, centers = 2)
pairs(boston_scaled[,6:10], col = km$cluster)
Looks better.
Reload the dataset and standardize it.
data("Boston")
boston_scaled <- as.data.frame(scale(Boston))
K-means clustering with 5 clusters and LDA on these 5 clusters.
km <- kmeans(boston_scaled, centers=5)
boston_scaled <- mutate(boston_scaled, cluster=km$cluster)
lda_boston <- lda(cluster ~., data=boston_scaled)
Plot the results.
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
heads <- coef(x)
arrows(x0 = 0, y0 = 0,
x1 = myscale * heads[,choices[1]],
y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
text(myscale * heads[,choices], labels = row.names(heads),
cex = tex, col=color, pos=3)
}
plot(lda_boston, dimen=2, col=boston_scaled$cluster)
lda.arrows(lda_boston, myscale = 2, tex=1)
The variables having the bggest impact are taxand radseparating clusters 1 and 2 from the others. From the remaining variables zn separates clusters 3, 4 and 5 from each other.
library(dplyr)
library(GGally)
human <- read.table("data/human.txt", row.names = 1, header=TRUE, sep="\t")
str(human)
## 'data.frame': 155 obs. of 8 variables:
## $ Pop2EduF : num 97.4 94.3 95 95.5 87.7 96.3 80.5 95.1 100 95 ...
## $ LabFperM : num 0.891 0.819 0.825 0.884 0.829 ...
## $ ExpEdu : num 17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
## $ LifeExp : num 81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
## $ GNI : int 64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
## $ MatMort : int 4 6 6 5 6 7 9 28 11 8 ...
## $ AdolBirth : num 7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
## $ PercInParl: num 39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
Data frame looks as it should look.
ggpairs(human)
summary(human)
## Pop2EduF LabFperM ExpEdu LifeExp
## Min. : 0.90 Min. :0.1857 Min. : 5.40 Min. :49.00
## 1st Qu.: 27.15 1st Qu.:0.5984 1st Qu.:11.25 1st Qu.:66.30
## Median : 56.60 Median :0.7535 Median :13.50 Median :74.20
## Mean : 55.37 Mean :0.7074 Mean :13.18 Mean :71.65
## 3rd Qu.: 85.15 3rd Qu.:0.8535 3rd Qu.:15.20 3rd Qu.:77.25
## Max. :100.00 Max. :1.0380 Max. :20.20 Max. :83.50
## GNI MatMort AdolBirth PercInParl
## Min. : 581 Min. : 1.0 Min. : 0.60 Min. : 0.00
## 1st Qu.: 4198 1st Qu.: 11.5 1st Qu.: 12.65 1st Qu.:12.40
## Median : 12040 Median : 49.0 Median : 33.60 Median :19.30
## Mean : 17628 Mean : 149.1 Mean : 47.16 Mean :20.91
## 3rd Qu.: 24512 3rd Qu.: 190.0 3rd Qu.: 71.95 3rd Qu.:27.95
## Max. :123124 Max. :1100.0 Max. :204.80 Max. :57.50
Most of the variables are close to normally distributed, except maternal mortality, GNI and adolescent birth rate which resemble more poisson distribution. There clear correlations between some of the variables, such as expected education and life expectancy ansd maternal mortality and adolescent birth rate. The variance in GNI is very large compared to the other variables.
pca_human <- prcomp(human)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped
Due to the large variation in GNI, we need to scale the data to get some meaningful results.
human_scaled <- scale(human)
pca_human <- prcomp(human_scaled)
s <- summary(pca_human)
pca_pr <- round(100*s$importance[2, ], digits = 1)
pca_pr
## PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8
## 56.1 16.5 9.9 6.8 3.8 2.9 2.5 1.4
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])
The scaling definitely made a difference, the differences between the country are not only defined based on the differences on the GNI index.
Most of the variation (PC1, 56.1 %) can be explained with the educational status of the population, especially of the women, life expectance and GNI, that negatively correlate with maternal mortality and adolescent birth rate. The second axis (PC2, 16.5 %) shows the difference of women working and their percentage in parlament.
library(FactoMineR)
library(tidyr)
data(tea)
str(tea)
## 'data.frame': 300 obs. of 36 variables:
## $ breakfast : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
## $ tea.time : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
## $ evening : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
## $ lunch : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
## $ dinner : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
## $ always : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
## $ home : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
## $ work : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
## $ tearoom : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ friends : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
## $ resto : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
## $ pub : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
## $ Tea : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
## $ How : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
## $ sugar : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
## $ how : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ where : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
## $ price : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
## $ age : int 39 45 47 23 48 21 37 36 40 37 ...
## $ sex : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
## $ SPC : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
## $ Sport : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
## $ age_Q : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
## $ frequency : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
## $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
## $ spirituality : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
## $ healthy : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
## $ diuretic : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
## $ friendliness : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
## $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ feminine : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
## $ sophisticated : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
## $ slimming : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ exciting : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
## $ relaxing : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
## $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
dim(tea)
## [1] 300 36
gather(tea) %>% ggplot(aes(value)) + geom_bar() + facet_wrap("key", scales = "free") + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped
We’ll select few of the variables for the Multiple Correspondence Analysis.
keep_columns <- c("home", "friends", "Tea", "how", "sex")
tea_time <- dplyr::select(tea, one_of(keep_columns))
mca <- MCA(tea_time, graph = FALSE)
summary(mca)
##
## Call:
## MCA(X = tea_time, graph = FALSE)
##
##
## Eigenvalues
## Dim.1 Dim.2 Dim.3 Dim.4 Dim.5 Dim.6
## Variance 0.269 0.230 0.216 0.199 0.178 0.173
## % of var. 19.191 16.442 15.410 14.216 12.683 12.344
## Cumulative % of var. 19.191 35.633 51.043 65.258 77.941 90.286
## Dim.7
## Variance 0.136
## % of var. 9.714
## Cumulative % of var. 100.000
##
## Individuals (the 10 first)
## Dim.1 ctr cos2 Dim.2 ctr cos2 Dim.3
## 1 | 0.928 1.067 0.598 | 0.114 0.019 0.009 | -0.181
## 2 | 0.494 0.302 0.190 | 0.041 0.002 0.001 | -0.381
## 3 | -0.503 0.313 0.492 | 0.229 0.076 0.102 | -0.126
## 4 | 0.414 0.213 0.183 | 0.608 0.535 0.394 | -0.218
## 5 | 0.414 0.213 0.183 | 0.608 0.535 0.394 | -0.218
## 6 | 0.414 0.213 0.183 | 0.608 0.535 0.394 | -0.218
## 7 | -0.069 0.006 0.007 | 0.303 0.133 0.137 | 0.074
## 8 | 0.494 0.302 0.190 | 0.041 0.002 0.001 | -0.381
## 9 | 0.170 0.036 0.024 | 0.020 0.001 0.000 | -0.016
## 10 | 0.683 0.579 0.271 | -0.474 0.325 0.130 | 0.021
## ctr cos2
## 1 0.051 0.023 |
## 2 0.224 0.113 |
## 3 0.025 0.031 |
## 4 0.073 0.050 |
## 5 0.073 0.050 |
## 6 0.073 0.050 |
## 7 0.008 0.008 |
## 8 0.224 0.113 |
## 9 0.000 0.000 |
## 10 0.001 0.000 |
##
## Categories (the 10 first)
## Dim.1 ctr cos2 v.test Dim.2 ctr
## home | 0.001 0.000 0.000 0.050 | -0.094 0.747
## Not.home | -0.016 0.001 0.000 -0.050 | 3.043 24.141
## friends | -0.434 9.162 0.355 -10.303 | -0.254 3.658
## Not.friends | 0.818 17.266 0.355 10.303 | 0.478 6.894
## black | 0.859 13.545 0.242 8.498 | -0.816 14.261
## Earl Grey | -0.471 10.640 0.401 -10.946 | 0.368 7.583
## green | 0.831 5.650 0.085 5.050 | -0.325 1.009
## tea bag | 0.060 0.149 0.005 1.177 | 0.601 17.765
## tea bag+unpackaged | -0.573 7.669 0.150 -6.698 | -0.809 17.836
## unpackaged | 1.216 13.213 0.202 7.766 | -0.723 5.452
## cos2 v.test Dim.3 ctr cos2 v.test
## home 0.286 -9.255 | -0.110 1.087 0.391 -10.810 |
## Not.home 0.286 9.255 | 3.555 35.143 0.391 10.810 |
## friends 0.121 -6.026 | 0.234 3.330 0.104 5.566 |
## Not.friends 0.121 6.026 | -0.442 6.276 0.104 -5.566 |
## black 0.218 -8.071 | 0.219 1.101 0.016 2.172 |
## Earl Grey 0.245 8.554 | 0.135 1.086 0.033 3.134 |
## green 0.013 -1.975 | -1.281 16.742 0.203 -7.789 |
## tea bag 0.472 11.878 | -0.364 6.954 0.173 -7.194 |
## tea bag+unpackaged 0.299 -9.455 | 0.106 0.324 0.005 1.233 |
## unpackaged 0.071 -4.617 | 1.442 23.146 0.284 9.211 |
##
## Categorical variables (eta2)
## Dim.1 Dim.2 Dim.3
## home | 0.000 0.286 0.391 |
## friends | 0.355 0.121 0.104 |
## Tea | 0.401 0.263 0.204 |
## how | 0.283 0.472 0.328 |
## sex | 0.305 0.008 0.052 |
plot(mca, invisible=c("none"), habillage = "quali")
Based on the MCA analysis, men drink green and black tea unpacked home alone.
Read in the data and convert the IDand Group columns to factors.
library(tidyverse)
RATSL <- read.table("data/RATSL.txt", header = TRUE)
RATSL$ID <- as.factor(RATSL$ID)
RATSL$Group <- as.factor(RATSL$Group)
glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ WD <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
The data seems to be in order.
Next to the analyses.
ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
As higher weight in the beginning seems to mean also higher weight in the end, standardization is required.
\[\ standardized(x) = \frac{x-mean(x)}{sd(x)}\]
RATSL <- RATSL %>%
group_by(Time) %>%
mutate(stdweight = (Weight-mean(Weight))/sd(Weight)) %>%
ungroup()
ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID)) +
geom_line() +
scale_linetype_manual(values = rep(1:10, times=4)) +
facet_grid(. ~ Group, labeller = label_both) +
theme(legend.position = "none") +
scale_y_continuous(limits = c(min(RATSL$stdweight), max(RATSL$stdweight)), name = "standardized bprs")
The starndard error of the mean can be calculated with the following equation: \[\ se = \frac{sd(x)}{\sqrt{n}} \]
n <- RATSL$Time %>% unique() %>% length()
RATS_trmt <- RATSL %>% group_by(Group, Time) %>% summarize(mean = mean(Weight), se = sd(Weight)/sqrt(n))
glimpse(RATS_trmt)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
ggplot(RATS_trmt, aes(x=Time, y=mean, linetype = Group, shape = Group)) +
geom_line() +
scale_linetype_manual(values = c(1,2,3)) +
geom_point(size=2) +
scale_shape_manual(values = c(1,2,5)) +
geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.5) +
theme(legend.position = c(0.8,0.5), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
ggplot(RATSL, aes(x= as.factor(Time), y=Weight, fill=Group)) +
geom_boxplot() +
theme(legend.position = c(0.8,0.4), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
scale_y_continuous(name = "mean(Weight) +/- se(Weight)") + scale_x_discrete(name = "Time") +
scale_fill_manual(values=c("black", "grey50", "white"))
RATSL8 <- RATSL %>%
group_by(Group, ID) %>%
summarize(mean = mean(Weight)) %>%
ungroup()
ggplot(RATSL8, aes(x=Group, y=mean)) +
geom_boxplot() +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
scale_y_continuous(name = "mean(Weight) per group")
There’s one outlier in the group 2. It will be removed. One outlier data point can be removed. And the data plotted again.
RATSL8S1 <- filter(RATSL8, (Group=="2" & mean<500) | Group!="2")
ggplot(RATSL8S1, aes(x = Group, y = mean)) +
geom_boxplot() +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(), panel.border = element_rect(fill = FALSE)) +
stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
scale_y_continuous(name = "mean(Weight) per group")
Seems obvious thst there’s differences between groups. Let’s still test that using analysis of variance.
RATS_lm <- aov(mean ~ Group, data = RATSL8S1)
summary(RATS_lm)
## Df Sum Sq Mean Sq F value Pr(>F)
## Group 2 206390 103195 483.6 3.39e-12 ***
## Residuals 12 2561 213
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As expected there was a difference between groups. We can do a post hoc test to see pairwise differences.
TukeyHSD(RATS_lm)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = mean ~ Group, data = RATSL8S1)
##
## $Group
## diff lwr upr p adj
## 2-1 185.73864 159.35462 212.1227 0.00e+00
## 3-1 262.07955 238.21430 285.9448 0.00e+00
## 3-2 76.34091 46.57572 106.1061 4.94e-05
There was a clear difference between all groups.
Read in the data and convert the treatmentand subject columns to factors.
BPRSL <- read.table("data/BPRSL.txt", header = TRUE)
BPRSL$treatment <- as.factor(BPRSL$treatment)
BPRSL$subject <- as.factor(BPRSL$subject)
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ weeks <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ week <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...
The data seems to be in order.
Next to the analyses.
ggplot(BPRSL, aes(x=week, y=bprs, group=subject, shape=treatment)) +
geom_point() +
theme(legend.position = "top", panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE))
ggplot(BPRSL, aes(x = week, y = bprs, fill=subject)) +
geom_line(aes(linetype = treatment)) +
theme(legend.position = c(0.8,0.8), panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2), expand=c(0,0)) + scale_y_continuous(name = "bprss") + theme(legend.position = "top")
First we fit a random intercept model
library(lme4)
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2748.7 2768.1 -1369.4 2738.7 355
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0481 -0.6749 -0.1361 0.4813 3.4855
##
## Random effects:
## Groups Name Variance Std.Dev.
## subject (Intercept) 47.41 6.885
## Residual 104.21 10.208
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 1.9090 24.334
## week -2.2704 0.2084 -10.896
## treatment2 0.5722 1.0761 0.532
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.437
## treatment2 -0.282 0.000
Then we fit a random intercept and random slope model and test wheter it’s better than the random intercept only model.
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2745.4 2772.6 -1365.7 2731.4 353
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -2.8919 -0.6194 -0.0691 0.5531 3.7976
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 64.8202 8.0511
## week 0.9608 0.9802 -0.51
## Residual 97.4307 9.8707
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 46.4539 2.1052 22.066
## week -2.2704 0.2977 -7.626
## treatment2 0.5722 1.0405 0.550
##
## Correlation of Fixed Effects:
## (Intr) week
## week -0.582
## treatment2 -0.247 0.000
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref 5 2748.7 2768.1 -1369.4 2738.7
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4 7.2721 2 0.02636 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
It was better. So last we can try an interaction between week and treatment.
BPRS_ref2 <- lmer(bprs ~ week * treatment + (week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood ['lmerMod']
## Formula: bprs ~ week * treatment + (week | subject)
## Data: BPRSL
##
## AIC BIC logLik deviance df.resid
## 2744.3 2775.4 -1364.1 2728.3 352
##
## Scaled residuals:
## Min 1Q Median 3Q Max
## -3.0512 -0.6271 -0.0767 0.5288 3.9260
##
## Random effects:
## Groups Name Variance Std.Dev. Corr
## subject (Intercept) 65.0016 8.0624
## week 0.9688 0.9843 -0.51
## Residual 96.4699 9.8219
## Number of obs: 360, groups: subject, 20
##
## Fixed effects:
## Estimate Std. Error t value
## (Intercept) 47.8856 2.2522 21.262
## week -2.6283 0.3589 -7.323
## treatment2 -2.2911 1.9090 -1.200
## week:treatment2 0.7158 0.4010 1.785
##
## Correlation of Fixed Effects:
## (Intr) week trtmn2
## week -0.650
## treatment2 -0.424 0.469
## wek:trtmnt2 0.356 -0.559 -0.840
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week * treatment + (week | subject)
## Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
## BPRS_ref1 7 2745.4 2772.6 -1365.7 2731.4
## BPRS_ref2 8 2744.3 2775.4 -1364.1 2728.3 3.1712 1 0.07495 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
That was not better, so we’ll stick with the model witout the interaction and plot that.
library(gridExtra)
##
## Attaching package: 'gridExtra'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:randomForest':
##
## combine
p1 <- ggplot(BPRSL, aes(x = week, y = bprs, fill = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Observed bprs") +
theme(panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE),
legend.position = "top")
Fitted <- fitted(BPRS_ref1)
BPRSL <- mutate(BPRSL, fitted=Fitted)
p2 <- ggplot(BPRSL, aes(x = week, y = fitted, fill = subject)) +
geom_line(aes(linetype = treatment)) +
scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 2)) +
scale_y_continuous(name = "Fitted bprs") +
theme(panel.grid = element_blank(), panel.background = element_blank(),
panel.border = element_rect(fill = FALSE), legend.key = element_rect(fill=FALSE),
legend.position = "top")
grid.arrange(p1, p2, ncol=2)
That’s it.